Ephemeral Document Clustering for Web Applications
نویسندگان
چکیده
We revisit document clustering in the context of the Web. Speci cally, we investigate on-line ephemeral clustering, whereby the input document set is generated dynamically, typically by search results, and the output clustering hierarchy has a short life span, and is used for interactive browsing purposes. Ephemeral clustering for interactive use introduces several new challenges. It requires an e cient algorithm, since clustering is performed on-line. It also requires high precision, because users who are not domain experts are less tolerant to errors, and because the resulting hierarchy is fully automatically generated, as opposed to o -line clustering in which the hierarchy is often manually modi ed. Finally, interactive clustering requires a presentation layer that enables users to e ectively browse the hierarchy, including visualization techniques and automatic annotations of the hierarchy. We present new concepts, techniques and algorithms that tailor clustering to these requirements. More speci cally, we improve precision by feeding the clustering algorithm with more precise pro les using \lexical a nities" as indexing units. We give an optimal complete-link Hierarchical Agglomerative Clustering algorithm, with O(n) time complexity. Finally, we provide a visual yet light Java-based implementation of the presentation layer. We demonstrate the utility of ephemeral clustering through a search assistance utility that embodies these ideas. keywords: information retrieval, hierarchical clustering, document clustering IBM Research Report RJ 10186, April 2000. IBM Research in Haifa, MATAM, Haifa 31905, ISRAEL. Email: [email protected] IBM Almaden Research Center, 650 Harry Road, San Jose, California, USA. Email: [email protected] Department of Electrical Engineering, Technion, Israel Institute of Technology, Haifa 32000, ISRAEL. Email: [email protected] Department of Computer Science, Carnegie-Mellon University, Pittsburgh, PA, USA. Email: [email protected]
منابع مشابه
Hierarchical Fuzzy Clustering Semantics (HFCS) in Web Document for Discovering Latent Semantics
This paper discusses about the future of the World Wide Web development, called Semantic Web. Undoubtedly, Web service is one of the most important services on the Internet, which has had the greatest impact on the generalization of the Internet in human societies. Internet penetration has been an effective factor in growth of the volume of information on the Web. The massive growth of informat...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملA density based clustering approach to distinguish between web robot and human requests to a web server
Today world's dependence on the Internet and the emerging of Web 2.0 applications is significantly increasing the requirement of web robots crawling the sites to support services and technologies. Regardless of the advantages of robots, they may occupy the bandwidth and reduce the performance of web servers. Despite a variety of researches, there is no accurate method for classifying huge data ...
متن کاملUsing Text-Based Web Image Search Results Clustering to Minimize Mobile Devices Wasted Space-Interface
The recent shift in human-computer interaction from desktop to mobile computing fosters the needs of new interfaces for web image search results exploration. In order to leverage users’ efforts, we present a set of state-of-the-art ephemeral clustering algorithms, which allow to summarize web image search results into meaningful clusters. This way of presenting visual information on mobile devi...
متن کاملA word-based soft clustering algorithm for documents
Document clustering is an important tool for applications such as Web search engines. It enables the user to have a good overall view of the information contained in the documents. However, existing algorithms suffer from various aspects; hard clustering algorithms (where each document belongs to exactly one cluster) cannot detect the multiple themes of a document, while soft clustering algorit...
متن کامل